Reetesh Kumar

Reetesh Kumar

Fri Nov 11 2022

Wrapping up my journey scaling Software 2.0 development for AV

post

This is the foundation of what SW 2.0 is, and this is what has guided the entire revolution we’ve been witnessing in the past 10 years. Essentially, anywhere where sufficient data is available to capture an association between two domains (e.g. images to text, text to images, text to text, video to car actuation), we’ve been aggressively replacing traditional software (handwritten source + compiler) by software 2.0 (data + deep learning). In other words, data is the source code of AI. The rate at which this is happening is astounding. Now we’re essentially feeding raw recordings of surround

sensor data from a vehicle, and letting DL fully learn the mapping to high-level 3D/4D symbols fully describing scenes. Or feeding raw associations between text and imagery data, and letting models fully learn to associate a plain English prompt to a possible image rendering. Many believe that Software 2.0, as described above, is now eating the world — just like Software [1.0] was eating the world in 2011 (original quote by Andreessen). I certainly do. Some believe we’re missing some key ingredients to get to AGI, but this is a different debate, as the technology used to power SW 2.0 still

has a lot of runway, is still getting better and better when the models are grown in size (# parameters) and fed with more data — and a wave of very concrete innovations are being derived from it as i

t is. In other words, we have a technology that’s ripe to enable amazing new products, and we’re in the middle of this cycle already. SW 2.0 in AV Now that we’ve set the stage for SW 2.0, I want to zoom into Autonomous Vehicles (AV), and how SW 2.0 powers their development. Why this topic? Well this is where I’ve spent most of my focus in the past few years, and though I worked to support several other domains, this was the most complex, most challenging, and most cross functional problem I’ve encountered in my life. First some respect for the problem. Enabling vehicles to drive autonomously means enabling them to build a full understanding of their environment, spatially, present and future, and then leverage that understanding to plan and actuate the vehicle (accelerate/brake and steer

). The problem is daunting for a few reasons: Building this understanding of the environment would be trivial if the environment was finite, and a set of sensors existed that could map the environment into a reliable internal representation. BUT, of course the environment is our physical world, with the diversity of environments you can imagine, across roads, vehicles, weather, intersections, buildings, construction sites, traffic lights, signs, etc. And then there’s no sensor that can reliably transform those real world attributes into an internal representation directly — all you have is cameras, ultrasonics, lidars, radars, microphones, etc. each giving you a clue about the physical world, by transforming it into a raw vectorial representation of it, that needs tremendous work to make

sense of (that’s where DL comes in, but more on that next). Even if the environment was more contained (finite diversity of roads, buildings, etc.), the dimensionality of the sensorial input is staggering. You end up requiring 10–30 unique sensors distributed around the vehicle, to give you enough sensorial input and diversity to give you a chance at solving the problem. The distribution of sensors vary across the industry — on the one extreme, the choice to have fewer of them (e.g. cameras only) and bet on solving the problem entirely via DL, on the other other extreme the choice to have many of them (cameras, ultrasonics, radar, lidar) and fuse to achieve the same result but hopefully less data as each sensor senses different aspects of the physical world. Either way, your input vector describing the current input is incredibly high dimensional, rich, and “Raw” compared to the representation you need to create to enable actuation. Even if the sensor input was constrained, and you ha

d a simpler way to get a representation from the surrounding environment, the dimensionality of the state space is crazy high — think all the actors you encounter at any one time, what their intent is (is that pedestrian going to cross the road or read their iPhone and not move?), and the distribution of possible futures. In other words, even with perfect perception, the problem of predicting all possible futures, planning through them, and ultimately actuating, is damn hard. There’s many other aspects, but I’ll focus on just those for now. To recap them concisely: Dimensionality of input space (how diverse the physical world is) Dimensionality of sensor space (how large and diverse the sensors are) Dimensionality of state space (how many actors and possible actions) How do we go about solving this? That’s where Deep Learning (DL) comes in, and the SW 2.0 paradigm associated with it. There simply is no way to program a computer manually (SW 1.0) to fully represent the mapping between

the raw input space as presented to us via the sensors we’ve picked, and the final actuations that need to occur. Plain undoable. Instead what we do is we rely on DL to learn the mappings between that raw sensor data and the representation we need to enable actuation. That’s how it goes: Input: data acquired by a set of sensors (cameras, radar, etc.). These are typically fairly low level, raw ve

ctors that capture attributes of the world. In the case of cameras a 2D projection of the 3D physical world, in the visible space. In the case of lidar, a 2D projection of active laser beams projected on 3D surfaces and sensed back, giving a sense of distance to elements in the world. Output: a single, unified representation of the entire surroundings of the vehicle, describing each relevant object in 3D, with temporal information (direction + velocity vector, etc.) Learning that representation is …. A challenge :). For those used to training DL models, you’ll immediately guess what’s required: A great dataset representing pairs of {input, output} above to start with A great infrastructure to train models off of the current dataset, and test them A great infrastructure to continuously refine the dataset based on the latest model A great team and culture to iterate over 2 points above as many times as possible :) Many other problems have similar attributes and needs across the industry

. In other robotics applications of course, but also in the medical imaging space, in recommendation systems, etc.

1 Like